feat: automatically fall back to VAE tiling when an untiled decode exceeds the backend buffer limit#1621
feat: automatically fall back to VAE tiling when an untiled decode exceeds the backend buffer limit#1621RapidMark wants to merge 5 commits into
Conversation
VAE decode can fail on integrated / low-VRAM GPUs because the untiled
compute buffer exceeds the backend's maximum single-buffer allocation
(e.g. Vulkan maxBufferSize), even when total memory is plentiful. sd.cpp
already supports tiling that keeps each compute buffer small, but it had
to be requested up front with --vae-tiling; users hit a hard failure
instead of the working path that was one flag away.
Make --vae-tiling a tristate:
off - never tile (fail if the untiled buffer doesn't fit)
on - always tile (previous --vae-tiling behavior)
auto - (default) try untiled; if the compute buffer can't be allocated,
free it and retry once with tiling
Implemented by appending a `bool auto_tile` to sd_tiling_params_t (kept
at the end of the struct so the C ABI stays backward-compatible) and a
single fallback branch in VAE::decode. Bare `--vae-tiling` with no value
remains backward-compatible (= on). auto_tile round-trips through the
JSON gen-params load/save.
Validated on an AMD Radeon 8060S iGPU (Flux Krea Q4, 1024x1024, Vulkan):
--vae-tiling off fails at decode (8.5 GB buffer exceeds the device limit),
--vae-tiling auto logs the retry and completes by tiling, --vae-tiling on
tiles from the start.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wouldn't be possible to check with the real value, calculated from the graph before the allocation? |
…es review) Reviewer (wbruna) asked why retry-on-failure rather than checking the real buffer size from the graph up front. Good point: ggml can plan the exact compute-buffer size with no allocation. Add an opt-in probe to GGMLRunner: when set_probe_compute_buffer_fits(true), alloc_compute_buffer measures the planned size via ggml_gallocr_reserve_n_size (no_alloc planning, zero allocation) and, if it exceeds ggml_backend_buft_get_max_size(), returns false BEFORE the real reserve -- so the backend never emits its raw "allocation failed" error on the AUTO success path. VAE::decode enables the probe only around the untiled _compute in AUTO mode; the reactive output.empty()->tile path stays as the backstop for a genuine runtime OOM (planned size fits the max, but the device is full). get_max_size() is SIZE_MAX on CPU, so this no-ops there. Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024x1024): --vae-tiling auto now logs only the INFO "retrying with tiling" + completes (no allocation-failed spew); off still fails; on still tiles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Good call — done (pushed just now). Instead of retrying on failure, the AUTO path now measures the planned compute-buffer size up front with I kept the original retry-on-empty as a backstop for a genuine runtime OOM (planned size fits the max, but the device is actually full). Net effect on the auto path: the backend no longer prints its raw "allocation failed" error — just an INFO line and the tiled decode. Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024²): |
|
I think having a fallback to vae tiling is a much welcome addition, but I'm having some small issues with the user experience there. Modifying the syntax of For example we could add a Alternatively, set "auto" tiling as default and add something like a |
…out with --no-vae-tiling-fallback Addresses review (stduhpf): turning --vae-tiling into a tristate option that takes a value breaks previously-working command lines. Revert that: --vae-tiling stays the original boolean flag (force tiling on). The auto fallback is now the default (auto_tile defaults true), and since it only tiles when an untiled decode would exceed the backend buffer limit, it is non-breaking and strictly safer for everyone. Add --no-vae-tiling-fallback to disable it (fail instead of tiling) for anyone who wants the old hard-fail behavior. Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024^2): default auto-recovers (logs "retrying with tiling", exit 0); --no-vae-tiling-fallback fails (exit 1); --vae-tiling tiles from the start (exit 0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks — agreed, changing
Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024²): default auto-recovers (logs |
…hanges) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The proactive probe in alloc_compute_buffer() returns false on purpose when the untiled compute buffer exceeds the backend's max single-buffer size, so the VAE auto-tiling fallback can take over. Callers logged that deliberate deferral as an ERROR, so a successful tiled decode printed a misleading "alloc compute buffer failed" on every run. Gate the ERROR on a new compute_buffer_deferred_to_tiling flag so it only fires on a genuine allocation failure; the deferral path stays a DEBUG line. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
wbruna
left a comment
There was a problem hiding this comment.
These are mostly suggestions, but I'm marking as 'changes needed' anyway because Claude is listed as a co-author (see CONTRIBUTING.md).
| // Tristate with `enabled`: enabled => ON (always tile); else auto_tile => AUTO (tile only when | ||
| // an untiled VAE compute buffer can't be allocated, e.g. it exceeds the backend's max buffer | ||
| // size on an iGPU); else OFF (never tile, fail if the untiled buffer doesn't fit). Default AUTO. | ||
| // Appended (rather than folded into an enum) to keep the struct ABI backward-compatible. |
There was a problem hiding this comment.
I think these kind of detailed comments would be fine if the file already had detailed comments everywhere else.
Also (just suggestions, I don't know what @leejet would prefer): maybe we could use an extra_tiling_args parameter instead of a separate flag, since it'd be more useful as a workaround or for testing?
And rather than a simple on/off switch, maybe we could receive a threshold override here? Say, -1 for disabling, 0 for auto, > 0 as the new limit. That way, the user could increase it if they know the device can handle it, decrease it to save VRAM for other reasons, etc (working around the Vulkan 1G limit would be an immediate use case, too).
| // genuine runtime OOM (planned size <= max, but the device is full) | ||
| // is NOT caught here -- it still surfaces from the real reserve | ||
| // below, so the reactive fallback remains the backstop. | ||
| size_t max_size = ggml_backend_buft_get_max_size(buft); |
There was a problem hiding this comment.
I believe the size calculation is needlessly including the VAE weights here. By default I get:
[DEBUG] ggml_extend.hpp:1932 - vae: untiled compute buffer 2112.06 MB exceeds backend max single buffer 1024.00 MB; deferring to tiling
Fiddling with GGML_VK_SUBALLOCATION_BLOCK_SIZE env var, it shows:
[DEBUG] ggml_extend.hpp:1953 - vae compute buffer size: 1920.06 MB(VRAM)
But this may be kind of a moot point, at least on Vulkan: as far as I can tell by looking at the ggml code, the limit on Vulkan will by default be capped at 1G anyway (suballocation_block_size, which is the value reported by ggml_backend_buft_get_max_size; the more useful max_buffer_size doesn't seem to be accessible from the API 😕).
|
FWIW this is part of my sweeping memory-management / lazy-loading changes in #1470 as well. |
|
Thanks @pwilkin, good to know. This one's deliberately narrow — it's an OOM safety-net in the VAE decode path: when an untiled decode would exceed the backend's max buffer size, it automatically falls back to tiling instead of erroring out. So it should be complementary to the broader memory-management / lazy-loading in #1470 rather than overlapping it. Happy to rebase on top of #1470, or defer to it entirely if you and leejet would rather land the consolidated approach — whatever keeps things cleanest. |
|
@leejet — before I make any code changes here, I'd like your call on direction, since @wbruna raised two API options and deferred to your preference: 1. auto-fallback API. Currently a tristate on the VAE-tiling flag (off / on / auto — where auto tiles only when an untiled VAE compute buffer would exceed the backend's max single-buffer size). @wbruna suggested a threshold override instead — 2. size measurement. @wbruna noted the proactive measurement may include the VAE weights alongside the compute buffer (~2112 MB vs ~1920 MB), though it's likely moot on Vulkan since I'll hold the implementation until you point a direction, then make the changes. |
Actually, there is a way to access it: diff --git a/src/core/ggml_extend.hpp b/src/core/ggml_extend.hpp
index 4ebbc0a..2c14551 100644
--- a/src/core/ggml_extend.hpp
+++ b/src/core/ggml_extend.hpp
@@ -1922,6 +1922,21 @@ protected:
// genuine runtime OOM (planned size <= max, but the device is full)
// is NOT caught here -- it still surfaces from the real reserve
// below, so the reactive fallback remains the backstop.
+ if (sd_backend_is(runtime_backend, "Vulkan")) {
+ size_t max_size = 0;
+ for (int i = 0; i < ggml_graph_n_nodes(gf); ++i) {
+ ggml_tensor* op = ggml_graph_node(gf, i);
+ max_size = std::max(ggml_nbytes(op), max_size);
+ if (!ggml_backend_supports_op(runtime_backend, op)) {
+ LOG_DEBUG("%s: untiled compute op size %.2f MB exceeds backend support; deferring to tiling",
+ get_desc().c_str(),
+ max_size / 1024.0 / 1024.0);
+ compute_buffer_deferred_to_tiling = true;
+ return false;
+ }
+ }
+ LOG_DEBUG("%s: max op size = %.2f MB", get_desc().c_str(), max_size / 1024.0 / 1024.0);
+ } else {
size_t max_size = ggml_backend_buft_get_max_size(buft);
if (max_size > 0) {
ggml_gallocr* probe = ggml_gallocr_new(buft);
@@ -1937,6 +1952,7 @@ protected:
return false;
}
}
+ }
}
compute_allocr = ggml_gallocr_new(buft);worked perfectly with my card's 4GiB limit. With an SDXL 1024x960 gen, I've got:
while the slightly smaller 960x960 worked without tiling, despite the graph as a whole getting much larger than 4GiB:
Of course, that would assume that "unsupported" means "too large"; but a truly unsupported operation would end up failing in the same way as before. |
On memory-constrained backends — integrated GPUs especially — a full-image VAE decode allocates a single compute buffer larger than the backend's maximum single-buffer/allocation size, and sd.cpp hard-fails instead of falling back to the tiling it already supports. The user has to know to pass
--vae-tilingup front; otherwise the run crashes at the very end, after sampling has already completed.Repro
AMD Radeon 8060S (Strix Halo, RDNA3.5 iGPU, 128 GB unified memory), Vulkan backend, Flux Krea-dev Q4 at 1024×1024, with no tiling flag:
The ~8.5 GB single-shot VAE decode buffer exceeds the iGPU's Vulkan per-buffer limit. The card has ample total memory (it shares 128 GB system RAM) — the failure is the per-buffer ceiling, not capacity. The whole gen is lost after a successful sampling pass.
Change
Add an automatic fallback to tiling, on by default, and keep it non-breaking:
--vae-tilingstays exactly as it was — a boolean flag that forces tiling on.ggml_gallocr_reserve_n_size(no-alloc planning, zero allocation) and compared againstggml_backend_buft_get_max_size(); if it won't fit, the decode goes straight to tiling. This is non-breaking — a decode that previously fit behaves identically, and one that previously OOM'd now recovers — and strictly safer. On CPUget_max_size()isSIZE_MAX, so it no-ops there.--no-vae-tiling-fallbackdisables the fallback for anyone who wants the old hard-fail behavior._computestill returns empty at runtime (e.g. the planned size fit the max but the device is genuinely full), it frees the buffer and retries once tiled — so a true OOM is also covered.Implemented with a
bool auto_tileappended to the end ofsd_tiling_params_t(kept at the end so the C ABI stays backward-compatible; defaulttrue), the proactive probe inGGMLRunner::alloc_compute_buffer, and the fallback branch inVAE::decode.Choosing the real graph-planned size (not a hardcoded bytes-per-pixel estimate) keeps it correct across every VAE architecture (SD/SDXL/Flux/Wan/LTX) and backend with no tuning.
Validation (AMD Radeon 8060S iGPU, Krea Q4, 1024²)
vae: untiled decode buffer exceeded the backend limit; retrying with tiling, completes, exit 0--no-vae-tiling-fallback→ fails at decode, exit 1 (the old behavior, opt-in)--vae-tiling→ tiles from the start, exit 0The tiled GPU decode (~6.9 s) is also far faster than the usual workaround of routing the VAE to CPU (~29.5 s) to dodge the OOM, and is visually equivalent at 0.5 tile overlap.
Helps any constrained device, not just iGPUs — an 8 GB discrete card at high resolution hits the same per-buffer wall. Scoped to
decode(where the failure occurs);encodehas the same shape and could get the identical treatment later.Thanks to @wbruna for pushing toward the proactive graph-planned size, and @stduhpf for catching that the original tristate would have broken the
--vae-tilingsyntax (this revision keeps it a plain flag + auto-by-default + opt-out).